spatiotemporal residual network
Spatiotemporal Residual Networks for Video Action Recognition
Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches.
Reviews: Spatiotemporal Residual Networks for Video Action Recognition
This paper presents a framework that improves two stream networks for video action recognition by extending residual network to combine information from two streams into one single network. It significantly improves over previous state-of-the-art on two popular video action recognition benchmark. The downside of this paper is the limited novelty. There are previous work tried to combine two streams into a single network [1,2], and the temporal convolution is not new either [3]. Although the way to combine two streams is slightly different from previous work, the proposed approach is still pretty straightforward.